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PROBABILISTIC, STATISTICAL AND ALGORITHMIC ASPECTS 
OF THE SIMILARITY OF TEXTS AND APPLICATION TO 
GOSPELS COMPARISON 

GANE SAME LO ** AND SOUMAILA DEMBELE * 


Abstract. The fundamental problem of similarity studies, in the frame of 
data-mining, is to examine and detect similar items in articles, papers, books, 
with huge sizes. In this paper, we are interested in the probabilistic, and 
the statistical and the algorithmic aspects in studies of texts. We will be 
using the approach of k-shinglings, a k-shingling being defined as a sequence 
of k consecutive characters that are extracted from a text (/c > 1 ). The 
main stake in this field is to find accurate and quick algorithms to compute 
the similarity in short times. This will be achieved in using approximation 
methods. The first approximation method is statistical and, is based on the 
theorem of Glivenko-Cantelli. The second is the banding technique. And the 
third concerns a modification of the algorithm proposed by Rajaraman and al 
m, denoted here as (RUM). The Jaccard index is the one used in this paper. 
We finally illustrate these results of the paper on the four Gospels. The results 
are very conclusive. 


1. Introduction 

In the modern context of open publication, in Internet in particular, similarity 
studies between classes of objects become crucial. For example, such studies can 
detect plagiarism of books, of articles, and of other works. Also they may reveal 
themselves as decision and management tools. Another illustration of the impor¬ 
tance of such a knowledge concerns commercial firms. They may be interested in 
similarity patterns between clients from different sites or between clients who buy 
different articles. In the same order of ideas, movies renting companies may try 
to know the extent of similarity between clients subscribing for violence films and 
those renting action films for example. 

As a probability concept, the notion of similarity is quite simple. However in the 
context of Internet the data may be huge. So that the main stake is the quick 
determination of some similarity index. The shorter the time of computation, the 
better the case. So similarity studies should rely on powerful algorithms that may 
give clear indications on similarities in seconds. The contextualization of the simi¬ 
larity, and forming the sets to be compared, and the similarity computations may 
take particular forms according to the domains of application. 
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In this paper, we will be focusing on similarity of texts. This leads us to consider 
the approach of shinglings, that we will define in Section 2. 

The reader is referred to Rajaraman and al. ([!]) for a general introduction to 
similarity studies. In their book, they provide methods of determination of approx¬ 
imated indices of similarity. Also, they propose an algorithm that we denote as RU 
(for Rajaraman and Ullman). However this algorithm has not been yet investigated 
in the context of probability theory, up to our knowledge. Furthermore, an evalu¬ 
ation of the performances of such algorithms on usual texts may be of relevance to 
justify such methods. 

First, we want to review these methods in a coherent probabilistic and statistical 
setting allowing to reach - later - all the aspects of similarity in this field. Then we 
will describe the RU algorithm in details. We will point out its redundant sides, 
from which a modified algorithm - denoted RUM (for RU modified) - will be pro¬ 
posed. 

To evaluate the studied techniques, the four Gospels will be used with the ends of 
study of similarity. The techniques will be compared in terms of speed, request of 
time, request of computer science resources, and request of precision. 

The obtained results constitute a plea for improving these techniques when dealing 
with larger sizes. 

Regarding Gospels study, our results seem to be conclusive, that is the fourth 
canonical Gospels are significantly similar. 

This paper is organized as follows. In the next section, we define the similarity of 
Jaccard and its metric and probabilistic approaches. Section 3 is concerned with the 
similarity of texts. In Section 4, we discuss about computation stakes of similarity. 
In Section 5, we present different methods to estimate the similarity index. Finally 
in Section 7, we deal with applications of the described methods to the similarity 
between the four Gospels. We conclude the paper by giving some perspectives. 


2. Similarity of sets 

2.1. Definition. Let A and B be two sets. The Jaccard similarity of sets A and 
B, denoted sim{A, B), is the ratio of the size of the intersection of A and B to the 
size of the union of A and B: 


( 2 . 1 ) 


sim{A, B) 


#(4fnB) 

#(AUB)- 


It is easy to see that for two identical sets, the similarity is 100% and for two totally 
disjoint sets, it is 0%. 


2.2. Metric approach. Let us consider a non-empty set S and its power set ViS). 
Let us consider the application of dissimilarity: 
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V {Si,S2) e r{Sf,d{SiS2) = 1 - SimiSiS2). 
We have this simple result. 


Proposition 1. The mapping d is metric. 

Proof. Proving this simple result is not so obvious one might think. Indeed, special 
techniques are required to demonstrate the triangle inequality. This is done, for 
example, in ([2]), page 15. Here, we just outline the other conditions for a metric : 


(1) First, let us show that V (S' 1 , 5 ' 2 ) € (P(5))^, d{Si,S 2 ) > 0. 
We have 

#( 5 i n ^2) < #{Si u 52). 


Then 

Next 

Therefore 


#is,nS2) ^ 
#( 5 i u 52) - ■ 


#(^i n 52) 
#{Si u ^2) 


> 0 . 


d{Si,S2)>0. 

(2) Let us show that d{Si^S 2 ) = 0 = S' 2 . From (12.11) . we get 


#(^i n ^2) = #( 5 i u ^2) ^ Si = 82. 

( 3 ) Let us remark that d{Si_S 2 ) = d{S 2 ,Si), since we have Si D S '2 = S '2 H Si, 
and Si U S 2 = S 2 U Si, that is : the roles of Si and S 2 are symmetrical in 
what precedes. 


So, studying of the similarity is equivalent to studying the distance of dissimilarity 
d between two sets. 

2.3. Probabilistic approach. Let us give a probabilistic approach of the similar¬ 
ity. For that, let us introduce the notion of the representation matrix. Let n be 
the size of the introduced set above. 

Let us consider p subsets of S: Si,..., Sp. The representation matrix of Si,..., Sp 
consists in this: 

• We form a rectangular array of p -|- 1 columns. 

• We put S, Si,..., Sp in the first row. 

• We put in the column of S all the elements of S, that we might write from 
1 to n in an arbitrary order. 

• In the column of each Si, we will put 1 or 0 on the row i depending on 
whether the element of S is in Si or not. We then can see that for 
h ^ k, (S/j U Sfc) is the number of rows for which one of the columns of Sh 
or Sfe has 1 on them and {Sh H Sk) is the number of rows for which the 
two columns of Sh and Sk have 1 on them. 




4 


GANE SAME LO ** AND SOUMAILA DEMBELE * 


The illustration of the matrix representation is as follows : 


Element 

Si 

^2 


Sh 


Sk 



1 

1 

0 


0 


1 


1 

2 

0 

0 


1 


0 


0 


0 








i 

1 

0 


1 


1 


1 










n 

0 

0 


0 


0 


1 


Table (2.1) 


Let us denote (Sih)i<i<n the column of Sh- We obtain 


sim{Sh,Sk) 


1 ^ ^ ^ ^5 Sih — Sik — 1 } 

#{*, 1 < * < n, {Sih + Sik = 1) + {Sih = Sik = 1)} ’ 


This formula can be written also in the following form : 


sim{Sh,Sk) 


_ #{i, 1 < i < n,Sih + Sik = 2} _ 

1 < * < (Sih + Sik = 1 ) + (Sih + Sik = 2 )} 


In the next theorem, we will establish that the similarity is a conditional proba¬ 
bility. 


Theorem 1. Let us randomly pick a row X among n rows. Let Sx,h be the value 
of the row X for a column h, 1 < h < p. Then the similarity between two sets 
Sh and Sk is the probability of the event {Sx,h = Sx,k = 1) with respect to the 
event {Sxm U Sx,k > !)• *-e 

sim{Sk , Sh) = P[(S'x.?i = Sx,k = 1)/+ Sx,k > !)]• 

Proof. We first observe that for the defined matrix below, the set of rows can be 
split into three classes, based on the columns Sk and Sh- 

1. The rows X such as we have (1,1) on the two places for columns Sk and Sh- 

2. The rows Y such as we have (1,0) or (0,1) on the two places for columns Sk 
and Sh- 

3. The rows Z such as we have (0,0) on the two places for columns Sk and Sh- 

Let us show that sim{Sk , Sh) = P[(-S'x,/i = Sx,k = i)/{Sx,h U Sx,k > !)]• 

Clearly, the similarity is the ratio of the number of rows X to the sum of the 
numbers of rows X and the number of rows Y. The rows Z are not involved in the 
similarity between Sh and Sk- Thus 

_ #{i, 1 < i < n, Sxh = 1, Sxk = 1} _ 

#{*, ^ < i <ri, {Sxh + Sxk = 1) + (Sxh = 1, Sxk = 1)} 


sim{Sk ,Sh) 













PROBABILISTIC, STATISTICAL AND ALGORITHMIC ASPECTS OF THE SIMILARITY 5 


Then, by dividing the numerator and the denominator by n, we will have 

#{i,l<i<n,Sxh = '^,Sxk = '^} 

Sim{Sk ,Sh) = #{^^l<^<n,(Sxh+Sxk = -^) + {Sxh = -^,Sxk^Wf ^ 

n 

Hence we get the result 

sim{Sk , Sh) = P[(5'x./i = Sx,k = 1)/ {Sx,h U Sx,k > !)]• 


This theorem will be the foundation of statistical estimation of the similarity as a 
probability. 

Important remark. When we consider the similarity of two subsets, say Sh and 
Sk and we use the global space as ShUSk, we may see that the similarity is, indeed, 
a probability. But when we simultaneously study the joint similarities of several 
subsets, say at least Sh, Sk and Si with the global set Sh^ Sk D Si, the similarity 
between two subsets is a conditional probability. Then, using the fact that the 
similarity is a probability to prove the triangle inequality is not justified, as claimed 
in [1], page 76. 

2.4. Expected similarity. Here we shall use the language of the urns. Suppose 
that we have a reference set of size n that we consider as an urn U. We pick at 
random a subset X of size k and a subset Y of size m. If m and k have not the 
same value, the picking order of the first set does have an impact on our results. 
We then proceed at the beginning by picking at random the first subset, that will 
be picked all at once, next put it back in the urn U (reference set). Then we pick 
the other subset. Let us ask ourselves the question : what is the expected value of 
the similarity of Jaccard? 

The answer at this question allows us later to appreciate the degree of similarity 
between the texts. We have the following result : 


Proposition 2. Let U be a set of size n. Let us randomly pick two subsets X and 
Y of U, of respective sizes m and k according to the scheme described above. We 
have 


( 2 . 2 ) 


F{Card{XnY) = j) = 


1 f cL c’ylL 

2 I ~cl 

0 


ci 



if 0 < j < min(/c, 

otherwise 


Further 


(2.3) E{sim{X, F)) 


y _ I _ 

^ 2{m + k- j) 


no 

^n—m 

C k 
n 


^m-j \ 
^n-k 1 

Cil Cff f 


Proof. Let us use the scheme described above. Let us first pick the set X. We have 
L = C!f possibilities. Let us denote the subsets that would take X by Xi,...,Xl. 
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The searched probability becomes 

L L 

f>{Card{Xf\Y) = j) = ^ P((C'ard(Xny) = j)nX,) = ^ P((C'ard(Xny) = j)/X,)P(X«) 


Once Xs is chosen and fixed, we get 


V{{Card{X n F) = j)/X,) = 


Cl c: 


k-j 


m ^n—m 


nk 


Since P(Xs) = 1/0^ = l/F, we conclude 

F{Card{X nV)=j)=J^ ^ (1/L) = 


S =1 


/^k r^m 


The result corresponding to picking up Y first, is obtained by symmetry of roles of 
k and n. We then get (|2.2D . The formula (|2.3D comes out immediately since 


(2.4) 


sim{X, Y) 


#(^ny) 

#(XUF) 


#{Xf^Y) 

TO + A:-#(XnF)' 


3. Similarity of texts: 

The similarity is an automatic tool to anticipate the plagiarism, abusive quotations, 
influences, etc. However the study of the similarity of texts relies for instance on 
the words and not on the meanings. 

3.1. Forming of sets for comparison. If we want to compare two texts S*! and 
S' 2 , we must transform them in shinglings sets. For fc > 0, a k-shingling is simply 
a word of k letters. For finding the k-shinglings of a string, we first consider the 
word of k letters beginning with the first letter, the word of k letters beginning 
with the second letter, the word of k letters beginning with the third, etc.., until 
the word of k letters finishing by the last letter of the string. So, a string of n 
letters is transformed into (n — fc + 1) k-shinglings. 

We observe a serious difficulty in the practice in using the notion of similarity de¬ 
fined on sets of k-shinglings. Indeed, when we consider the k-shinglings of a text, 
it is very probable that some k-shinglings will be repeated. Then the collection of 
k-shinglings cannot define a mathematical set (whose elements are supposed to be 
distinct). 

But fortunately, a k-shingling is determined by its value and its rank. Suppose 
that a text has a length n. We can denote the k-shinglings by means of a vector t 
oi n — k -\- 1 dimensions so that t{i) is the k-shingling. The k-shinglings set is 
defined by: 


{{i,t{i)),i = 1,.,n - fc -k 1} 

With this definition, the k-shinglings are different and do form a well-defined math¬ 
ematical set. 
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3.2. Interpretation of the similarity of texts. Does the similarity between 
two texts have necessarily another explanation other than randomness? To answer 
to this question, let us remark that in any language, a text is composed from an 
alphabet that is formed by a finite and even small number of characters. A text in 
English is a sequence of lowercase and uppercase letters of the alphabet, of numbers 
and of some signs such as punctuations, apostrophes, etc. This set doesn’t exceed 
a hundred characters. 

Suppose that the computed similarity between the two sets is po. From what 
point can we reasonably consider that there is a possible collision between the au¬ 
thors, either the two texts are based on similar sources, or one author has used 
the materials of the other? To answer this question, we have to know the part 
due to randomness. As a matter of fact, any text is written from a limited set 
of k-shinglings. Then each k-shingling is expected to occur many times and hence 
contributes to rise the similarity. Let us consider a set of size n = m + i k-shinglings 
containing those of the two compared texts. If the two texts were randomly writ¬ 
ten, that is the same to saying that they were written by machines subjected to 
randomness, the expected similarity that we denote by pn would be given by (j2.3l) . 
So we can say that the two authors would have some kind relationship of mutual 
influence or that plagiarism is suspected, if po is significantly greater than p^. 

It is therefore important to have an idea of the value of pr for sizes of the order of 
those of studied texts. For example, with the Bible texts that we study, the texts 
sizes go approximately from 50.000 to 110.000. The values pr for these sizes turn 
round 30%. This knowledge is important to interpret the results. 

3.3. Implementation of the algorithm for computing the similarity of 
texts. Let A and B be two texts. Fixed fc > 1 and let us consider the two 
k-shinglings sets 

{{iAA{i)),i = 1,. ,nA - fc -f 1) 

and 


= 1,.,71b -fc-bl). 


The determination of the similarity between the two texts is achieved through com¬ 
paring each k-shingling of A with all k-shinglings of B. We will have two problems 
to solve. 

Suppose that a k-shingling is represented many times in B. We have the risk that 
the same value of this k-shingling in A is used as many times when forming the 
intersection between of k-shinglings sets. This would result in a disaster. 

To avoid that, we associate to each k-shingling {i,tA(i)) at most one k-shingling 
(j,^ bO))- Let us use the wedding language by considering the k-shinglings of 
A as husbands, and the k-shinglings of B as wives and, then, the association be¬ 
tween a k-shingling of A to a k-shingling of i? as a wedding. Our principle says 
that a k-shingling of A can marry at most one k-shingling of B. In the same 




8 


GANE SAME LO ** AND SOUMAILA DEMBELE * 


way, a k-shingling of B can be married at most to one k-shingling of A. We are 
in a case of perfect symmetry monogamy. How to put this in practice in a program? 

It suffices to introduce the sentinel variables that identify if a k-shingling husband 
or a k-shingling wife has a wife or a husband at the moment of the comparison. 

Let us introduce the vectors 

{testA{i), i = 1, ., UA — k -h 1) 

and 

{testBij),j = 1, . ,nB -k-\-l). 


We put testA{i) = 1 if k-shingling has already a wife, testA{i) = 0 otherwise. We 
define testBij) in the same manner. We apply the following algorithm: 

1 . set sim = 0; 

2. Repeat for f = 1 to — fc + 1; 

2 a. if testA{i) = 1 : nothing to do; 

2 b. else 

2 b-l. do for : j = 1 to ub — k 1; 

2 b-ll. if tB{j) = 1 : nothing to do; 

2 b-12. else compare tA{i) to tB{j)', 

2b-13. if equality holds, increment sim and put testA{i) = 1, 

testBij ) = 1; 

2b-14. else go to the next j . 

3. report the similarity (simj {ua + ub — sim)) 

4. Computation stakes 

The search of similarity faces many challenges in the Web context and at the local 
post of personal computer. 


4.1. Limitation of the random access memory (RAM). When we want to 
compare two sources of texts, each leading to a large number of shinglings, say ni 
and n 2 , using the direct method will load in memory the vectors tA, tB, test (A) 
and testiB). When m and n 2 are very large with respect to the capacities of the 
machine, this approach becomes impossible. For example, for the values of ni and 
n 2 in order of 98000000, the declaration of vectors of that order leads to an overflow 
in Microsoft VB6^. 

We are tempted to appeal to another method, that directly uses data from files. 
Here is how it works: 

(1) open the file of the text A; 

(2) read a row of the file A; 

(3) open the file B: read all these rows one by one and compare the k—shinglings 
of the file B with the k—shinglings at the current row of file A. 

(4) close the file B; 

(5) go to the next row of file A. 
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This method that we denote by the similarity by file does practically not use the 
RAM of the computer. We are then facing to two competing methods. Each of 
them has its qualities and its defects. 

4.1.1. The direct method: It loads the vectors of k—shinglings in the RAM. It leads 
to quick calculations. However we have the risk to stuck the machine when the 
sizes of the files are huge. 

4.1.2. The method of similarity by file. It spares the RAM of the machine and 
increases the computing speed. However it leads to considerable times of computa¬ 
tions since, for example, the second file is opened as many time as the first contains 
rows. We spare the RAM but we lose time. 

You have to notice that in the implementation of this method, we always have to 
carry the incomplete ends of each row at the next row. 

Example 1. Suppose that we compare the b—shinglings of the first row of A and 
the first row of B. The last four letters of the row cannot form a 5—shinglings. We 
have to use them by adding them at the first place of the second row of A. These 
additional ends are denoted ’’boutavantl” ’s in the procedures done in (El), when 
we implement the similarity by file method. We do the same thing for the rows of 
B that give the ”boutavant2” ’s. 


For example, in the work on the Gospel versions, where the numbers of k-shinglings 
are of the order of one hundred thousands, the method of similarity by file takes 
around thirty minutes and the direct method requires more or less ten minutes. We 
reduce the time of computation by three at the risk to block the RAM. 

All what precedes advocates using approximated methods for computing similarity. 
Here, we are going to see three approaches but we only apply two of them in the 
study of the Gospel texts. 

5. Approximated computation of Similarity 

5.1. Theorem of Glivenko-Cantelli. Since the similarity is a conditional prob¬ 
ability in according to Theorem [1] we can deduce a law of Glivenko-Gantelli in the 
following way. 

Theorem 2. Let p be the similarity between two sets of total size n. Let us pick 
at random two subsets of respective sizes ni and n 2 so that n = ni + n 2 and let us 
consider the random similarity Pn between these two subsets. Then Pn converges 
almost-surely to p with a speed of convergence in the order of when ni and 

ni become very large. 


That is a direct consequence of the classical theorem of Glivenko-Cantelli. It then 
yields a useful tool. For example, for the similarity of Gospels for which the sim¬ 
ilarity is determined in more or less ten minutes, the random choice of subsets of 
size around ten thousand k-shinglings for each Gospel gives a computation time 
less than one minute, with an accuracy of 90%. To avoid the instability due to one 
random choice only, the average on ten random choices gives a better approximated 
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similarity in more or less one minute. We will widely come back to this point in 
the applications. 


5.2. Analysis of the Banding Technique. The banding technique is a supple¬ 
mentary technique based on the approximation of Theorem of Glivenko-Cantelli. 
Suppose that we divide the representation matrix, in b bands of r rows. The simi¬ 
larity can be computed first by considering the similarity between the different rows 
of one band then, between some bounds only. We do not use this approach here. 

5.3. Algorithm of RU. It is based on the notion of minhashing to reduce docu¬ 
ments of huge sizes into documents of small sizes called signatures. The computa¬ 
tion of the similarity is done on their compressed versions, i.e, on their signatures. 
To better explain this notion, let us consider p subsets of a huge reference set. Let 
the matrix be dehned as below : 


Element 

Si 

^2 


53 

1 

1 

0 


0 

2 

0 

0 


1 


0 




i 

1 

0 


1 






n 

0 

0 


0 


Table (5.1) 


The similarity between two sets is directly got as soon as this table is formed by 
using the formula (Ell) in a quick way. But the setting of this matrix takes time. 
This is serious drawback of the original algorithm RU that we will precise soon. For 
the moment, suppose that the table exists. On this basis, we are going to introduce 
the RU algorithm. By this algorithm, we do three things. First, we consider an 
arbitrary permutation of the rows. Then, we replace the column of the rows by a 
transformation called minhashes by means of a congruence function. Then, a new 
table is formed to replace the original table. This new and shorter one, that we 
describe immediately below, is called signature matrix. 

5.3.1. Minhashing signature. Suppose that the elements of S are given in a certain 
order denoted from 1 to n. Let us consider p functions hi (i = 1,..., p) from {1,..., n} 
in itself in the following form: 


(5.1) 


hi{x) = QiX -I- bi mod n, 


where and bi are given integers. We modify this function in the following way: 
hi{x) = n when the remainder of the euclidian division is zero. We then can 
transform the matrix as follows : 
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Element 


S 2 


Sm 

hi 


hp 

1 

1 

0 


0 

hi{l) 


hp{l) 

2 

0 

0 


1 

hii2) 


hpi2) 


0 







i 

1 

0 


1 

hi{i) 


hp (z) 









n 

0 

0 


0 

hi{n) 


hp{n) 


Table (5.2) 


The RU algorithm replaces this matrix by another smaller one called minhashing 
signature, that is : 


hashing 

Si 

S 2 



hi 

Cll 

C12 


Clm 

h 2 

C21 

C22 


^ 2 m 






hp 

Cpl 

Cp 3 


^pm 


Table (5.3) 

To fill the table above, Rajaraman and al. m), page 65, propose the algorithm 
below: 


Algorithm of filling of the columns Sj : 

1. Set all the Crj equal to oo. 

2. For each column S'j ,proceed like this 

2-a. for each element i, from 1 to n, compute hi{i), h 2 {i), ., hp[i). 

2-b. if i is not in Sj, then do nothing and go to i + 1 

2-c. if i is in Sj, replace all the rows {crj)i<r<p by the minimum: min(crj, hr(i)). 
2-d. go to i + 1 

3. go to j + 1 

4. end. 


At the end of the procedure, each column will contain only integers between 1 and 
n. The computed similarity on this compressed table between Si and Sj, denoted 
simRU(S'i, S' 2 ), will be called approximated similarity RU. It is supposed to give 
an accurate approximation of the similarity. 


However we can simplify this algorithm in a very simple way by saying this. 

Criterion 1. The transpose of the column {crj)i<r<p,is the minimum of 
rows, when carried out coordinate by coordinate, {hi{i), ...,hp{i)), when i 
covers the elements of Sj. 

This simple remark allows to set up programs in a much easier way. 
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5.3.2. Algorithm of RU modified (RUM). It is clear that by forming the matrix of 
the table (5.1), the similarity is automatically computed. Indeed, when we consider 
the columns Si and Sj , we immediately see that the number of rows containing the 
unit number (1) on these two columns is the size of the intersection. Then the Jac- 
card similarity is already found and any further step is useless. The RU algorithm, 
on this basis, is not useful. Instead, forming this matrix is exactly applying the full 
method that requires comparison of each couple of shinglings of the two sets. This 
operation takes about thirty minutes for set of sizes one hundred thousands, for 
example. Based on this remark, we propose a modification for the implementation 
of the RU algorithm in that following way. Let us consider two sets Si and S 2 with 
respective sizes ni and 712 to be compared. We proceed like that: 

1. Form one set S by putting the elements of Si and then the elements of S 2 
with the double elements. Let n = ni + 712 . 

2. Apply the RU algorithm at this collection by using Criterion [T] 


We do not seek to find the intersections. Elements of the intersection are counted 
twice here. But it is clear that we still have a zero similarity index if the two sets 
Si and S 2 are disjoint, and a 100% index if the sets are identical. 

The question is : how well the estimations of the similarity using RU or RUM 
algorithm are good approximations of the true similarity index? We give in these 
paper an empirical response based on the Gospels comparison but showing that 
the RUM approximation of the similarity of good while performing only in a few 
seconds in place of thirty minutes (1.800 seconds)! 

The exact distribution of the RUM index is to be found depending on the laws of 
the stochastic laws of the coefficients and bi in (EH) in a coming paper. 


6 . The applications of the similarity of the Bible texts 

6.1. Textual context of the Gospel. Four versions of the Gospels 

Here, we are going to resume a few important points for the backgrounds of our 
Gospels analysis. In all this subsection, we refer to [3]. 

The Gospels (of Latin that means good news) are texts that relate the life and the 
teaching of Jesus of Nazareth, called Jesus Christ. Four Gospels were accepted as 
canonical by the churches: the Gospel according to Matthew, Mark, Luke and John. 
The other unaccepted Gospels are qualified apocryphal ones. Numerous Gospels 
have been written in the first century in our era. Before to be consigned as written, 
the message of Christ was verbally transmitted. From tale stories, many texts 
were composed, among which the four Gospels that were retained in the Biblical 
canon. The canonical Gospels are anonymous. They were traditionally attributed 
to disciples of Jesus Ghrist. The Gospel according to Matthew and the Gospel 
according to John would have been from direct witnesses of the preaching of Jesus. 
Those of Mark and Luke are related to close disciples. 
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The first Gospel is the one attributed to Mark. It would have been written in about 
70 years AD. In about 80 - 85, follows the Gospel according to Luke. The Gospel 
according to Matthew is dated between 80 and 90, and to finish, the one of John 
is dated in between 80 and 110. However, these uncertain dates vary according to 
the authors that propose chronologies of the evangelical texts. The original Gospels 
were written in Greek. 

The Gospel according to Matthew, Mark and Luke are called Synoptic. They tell 
the tale of Jesus in a relatively similar way. The Gospel according to John are 
written using another way of taling Jesus’ life and mission (christology) qualihed 
as Johannist. The first set of Gospel that has been written seems to be Mark’s 
one. According to some researchers, the common parts between Matthew and Luke 
Gospels may depend on a more older text that was lost. This text is referred as 
the Q source. 

The source Q or Document Q or simply Q (The letter is from the German word 
QUELLE, meaning source) is a hypothetical source, of whom some exegetes think 
it would be at the origin of common elements of Gospels of Matthew and Luke. 
Those elements are absent in Mark. It would be a collection of words of Jesus of 
Nazareth that some biblists attempted to reconstitute. This source is thought to 
date around of 50 AD. 

The Gospels of Matthew and Luke are traditionally influenced by Mark’s Gospel 
and the Old Testament. But though separately written, they have in common 
numerous extracts that don’t come from the two first cited sources. This is why 
the biblists of XIX® century generally think that these facts suggest the existence 
of a second common source, called ’’document Q”. Since the end of XIX® century, 
Logia (i.e the speech in Greek) seems to have been an essentially collection of 
speeches of Jesus. With the hypothesis of the priority of the Gospel of Mark, the 
hypothesis of the existence of the document Q is part of what the biblists call the 
hypothesis of two sources. 

This hypothesis of two sources is the most general solution that is accepted for the 
synoptic problem, that concerns the literary influences between the three canonical 
Gospels ( Mark, Matthew, Luke), called Synoptic Gospels. These influences are 
sensitive by the similarities in the choice of words and the order of these words in 
the statement. The ’’Synoptic problem” wonders about the origin and the nature 
of these relationships. Erom the hypothesis of two sources, not only Matthew and 
Luke learned all both on the Gospel according to Mark, independently one to other; 
but as we detect similarities between the Gospels of Matthew and Luke, that we 
cannot find in the Gospel of Mark, we have to suppose the existence of a second 
source. 

Synoptic Gospels 

The Gospels of Matthew, Mark, and Luke are considered synoptic Gospels on the 
basis of many similarities between them that are not shared by the Gospel of John. 
Synoptic means here that they can be seen or read together, indicating the many 
parallels that exist among the three. 
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The Gospel of John, on the contrary has been recognized, for a long time as distinct 
of first three Gospels so much by the originality of its themes, of its content, of the 
interval of time that it recovers, and of its narrative order and the style. Glement of 
Alexandria summarized the single character of the Gospel of John by saying : John 
came last, and was conscious that the terrestrial facts had been already exposed in 
the first Gospel. He composed a spiritual Gospel. 

Indeed, the fourth Gospels, the Gospel of John, presents a very different picture of 
Jesus and his ministry from the synoptics. In differentiating history from invention, 
some historians interpret the Gospel accounts skeptically but generally regard the 
synoptic Gospels as including significant amounts of historically reliable information 
about Jesus. The common parts of the Gospels of Matthew and of Luke depend on 
an antiquarian document but lost called source Q according to some researchers. 

The synoptic Gospels effectively have many parallels between them: thus around 
80% of verses of Mark may be found in Matthew and Luke Gospels. As the content 
is in three Gospels, one talks about of Triple tradition. The passages of the Triple 
Tradition are essentially narrations but we can find in it some speeches of Ghrist. 

But otherwise, we also find numerous identical passages between Matthew and 
Luke, but absent in the Gospel of Mark. Almost 25% of verses of the Gospel 
according to Matthew find an echo from Luke (but not from Mark). The common 
passages between Matthew and Luke are mentioned as the Double Tradition. 

The four Gospels constitute the principle documentary concerning the life and the 
teaching of Ghrist. Each of them uses a particular perspective. But all of them 
use the same general scheme and convey the same philosophy. We stop here. For 
further details see [3]. We will attempt to explain the results in our own analysis 
of similarity below. 

6.2. The general setting. All the computations were done in the environment of 
VB6^. Once the four texts are chosen, we follow these steps. We first proceed to 
the editing files by dropping the words of less than three letters. Then we proceed 
to the computations of the similarity between the different Gospels. 

Next, we find for each gospel, the number of the rows of files as well as the number 
of letters. 

Here is the first table for number of the rows, before and after editing. 



John 

Luke 

Mark 

Matthew 

Numbers of the rows 

2534 

3442 

628 

1319 

Numbers of letters before editing 

96269 

129548 

76543 

149747 

Numbers of letters after editing 

69316 

94766 

55555 

108722 


Table (6.1) 


Now we are going to report the common numbers of k-shinglings with fc = 3 be¬ 
tween the different Gospels and then compute the similarity between each couple 
of Gospels by the two exact methods. 
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The results are in Tables 6.2 and 6.3. 

Table : Case of computation of the similarity by the direct method be¬ 
tween the different Gospels 



Luke 

Mark 

Matthew 

John 

Sim= 57,62 % 
kc= 59981 
time= 755 s 

Sim= 57,53 % 
kc= 45600 
time= 816 s 

Sim= 51 % 
kc= 60134 
time= 510 s 

Luke 


Sim=54,12 % 
kc= 52782 
time= 640 s 

Sim=69,55 % 
kc= 83468 
time= 1430 s 

Mark 



Sim=48,74 % 
kc= 53827 
time= 508 s 


Table (6.2) 


Table : Case of computation of the similarity by the method by file be¬ 
tween the different Gospels 



Luke 

Mark 

Matthew 

John 

Sim= 57,62 % 
kc= 59981 
time= 2312 s 

Sim= 57,53 % 
kc= 45600 
time= 1376 s 

Sim= 51 % 
kc= 60134 
time= 3021 s 

Luke 


Sim=54,12 % 
kc= 52782 
time= 3080 s 

Sim=69,55 % 
kc= 83468 
time= 1457 s 

Mark 



Sim=48,74 % 
kc= 53827 
time= 1552 s 


Table (6.3) 


Approximated similarity 

In this part, the computation of the similarity will be done by the direct method. 
Let us pick randomly 10000 k-shinglings from first file and 10000 k-shinglings from 
the second file. We remark that the time of computation of the similarity turns 
around 20 seconds. We get approximated values of similarities between the Gospels. 
Let us use the two methods of computation through a double approximation of the 
similarity i.e, approximation using the theorem of Glivenko-Cantelli and of the 
RUM algorithm. The two results are given in the two tables as follows: 


Table : Gase of computation of the approached similarity by the theo¬ 
rem of Glivenko-Cantelli between the different Gospels 
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Luke 

Mark 

Matthew 

John 

Sim= 47,50 % 
time= 20 s 

Sim= 46,46 % 
time= 34 s 

Sim= 46,04 % 
time= 29 s 

Luke 


Sim=50,79 % 
time= 19 s 

Sim=50,26 % 
time= 22 s 

Mark 



Sim=52,28 % 
time= 27 s 


Table (6.4) 


Table : Case of computation of the approximated similarity by the RUM 
algorithm between the different Gospels 

This approach is simply extraordinary since we may use a very low number of hash 
functions and get good approximations. To guarantee the stability of the results, 
we report the average results got for BB=50 repetitions of the experience and the 
standard deviation of such a sequence of results. 

Case for pp=5 and BB=50. 



Luke 

Mark 

Matthew 

John 

Sim= 60 % 
Ecart= 20,76 

Sim= 58 % 
Ecart= 22,1 

Sim= 59,2 % 
Ecart= 20,38 s 


time= 26 s 

time= 25 s 

time= 22 s 

Luke 


Sim= 56,4 % 
Ecart= 19,87 

Sim= 63,2 % 
Ecart= 22,03 



time= 22 s 

time= 22 s 

Mark 



Sim= 59,2 % 
Ecart= 22,38 




time= 27 s 


Table (6.5) 


Table: Case of computation of the approximated similarity by the RUM 
algorithm between the different Gospels 

Case for pp=10 and BB=50. 



Luke 

Mark 

Matthew 

John 

Sim= 62 % 
Ecart= 16,68 

Sim= 61,6 % 
Ecart= 15,91 

Sim= 63,6 % 
Ecart= 14,52 s 


time= 26 s 

time= 25 s 

time= 22 s 

Luke 


Sim= 62 % 
Ecart= 14,56 

Sim= 58 % 
Ecart= 13,41 



time= 31 s 

time= 27 s 

Mark 



Sim= 63,8 % 
Ecart= 14,54 




time= 29 s 


Table (6.6) 
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Table: Case of computation of the approximated similarity by the RUM 
algorithm between the different Gospels 

Case for pp=15 and BB=50. 



Luke 

Mark 

Matthew 

John 

Sim= 57 % 
Ecart= 13,45 

Sim= 59,33 % 
Ecart= 11,33 

Sim= 58,26 % 
Ecart= 13,84 s 


time= 28 s 

time= 28 s 

time= 30 s 

Luke 


Sim= 60,13 % 
Ecart= 13,23 

Sim= 58,26 % 
Ecart= 13,51 



time= 30 s 

time= 28 s 

Mark 



Sim= 57,2 % 
Ecart= 14,17 




time= 29 s 


Table (6.7) 


Table: Case of computation of the approximated similarity by the RUM 
algorithm between the different Gospels 

Case for pp= 20 and BB=50. 



Luke 

Mark 

Matthew 

John 

Sim= 57,6 % 
Ecart= 10,63 

Sim= 60,8 % 
Ecart= 10,11 

Sim= 62,9 % 
Ecart= 10,63 s 


time= 32 s 

time= 31 s 

time= 31 s 

Luk 


Sim= 57,9 % 
Ecart= 9,59 

Sim= 63,6 % 
Ecart= 9,22 



time= 32 s 

time= 31 s 

Mark 



Sim= 60,7 % 
Ecart= 8,94 




time= 31 s 


Table (6.8) 


6.3. Analysis of results. 

6.3.1. Evaluation of algorithms. Algorithm on the similarity by the direct 
method 

In this algorithm, we first form the k-shinglings sets for each text. Then we com¬ 
pute the similarity between them. 

We remark that the time of the determination of the similarity between the differ¬ 
ent Gospels turns around ten minutes. The different similarity amounts are around 
50%. 


Algorithm on the similarity by the method by file 
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Here we remark that the times of the determination are much greater than those in 
the case of the similarity by the direct method. The time turns around 30 minutes. 
We naturally have the same similarities already given by the direct method. 

Algorithm on the similarity by the theorem of Glivenko-Cantelli 

We randomly pick a number NG = 10000 k-shinglings from both files and next we 
compute the similarity as we did in the case of the direct method. 

We remark a considerable reduction of the time of the determination of the sim¬ 
ilarity. The result is huge. The similarity indices are got in less a minute. The 
similarity also turns around 50 %. 

Algorithm on the similarity by RUM 

We randomly pick = 10000 k-shinglings from of the first hie and A^2=10000 
k-shinglings from the second hie. We apply the RUM algorithm with a number of 
hashing pp taking the values 5, 10, 15, 20. To guarantee the stability of results, 
the RUM method is used hfty times (BB=50) and the average similarity has been 
reported out in tables (6.5), (6.6), (6.7) and (6.8). 

Finally, we arrive at a tuning result : by using subsamples of the two sets and by 
using the approximation method via the RUM algorithm, we get an acceptable es¬ 
timation of the similarity in a few number of seconds. But since the results may be 
biased, performing the process a certain number of times and reporting the average 
is better. 


We may study the variability of the results. If we proceed BB = 50 times with 
pp = 20 hash functions, the different obtained values for the similarities present 
an empirical deviation of the order of 10%. This means that the reported value is 
accurate at 2%. 

For the Gospels for example, we finally conclude that the true estimation of the 
similarity is in an interval centered at the approximated value given by the RUM 
method with magnitude 10%. This result, that is achieved only in seconds, is very 
significant for large sets. 

We may also appreciate the power of this algorithm that allows estimation of the 
similarity of set around one hundred thousand (100.000) characters in only 6 sec¬ 
onds. 

6.3.2. Comparison of Gospels. From the tables (6.5), (6.6), (6.7) and (6.8), we no¬ 
tice that the Gospels of Luke and Matthew have the greatest similarity around 70 
%. From what we already said in Subsection 16.11 Luke and Matthew have used 
the Gospel of Mark and in addition, are based on unknown source Q. Likewise the 
similarity between the Gospel of John and the others might explained by the fact 
that the John Gospel is the last to be released in about year 100 or year 110 of our 
era. He might already be aware of the contents of the other three gospels. 
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We might hope to have a similarity around 90 %. But many factors can influence 
on the outcomes. Actually, the Gospels are written by four different persons. Each 
of them may use his own words. Besides, we used translated versions. This latter 
fact can result in a significant decrease of the true similarity. An other point con¬ 
cerns the fact that a limited alphabet is used. This in turn is in favor of forming a 
structural part in the similarity. For example, for the considered sizes, this part is 
around 30%. 

With the order of the sets sizes, we have the automatic and stochastic similarity 
of order of 30%. Since the similarities turn around 50% between the Gospels, we 
conclude that Gospels really have a significant similarity. By taking account the 
remarks that have been made above, we may expect that these similarities should 
be really much greater. This is in favor of the hypothesis of the existence of a 
common source that can be denamed as the source Q. 

6.3.3. Recommendations and perspectives. To conclude we recommend these fol¬ 
lowing steps in assessing similarity : 

1. Determine the automatic and stochastic part of the similarity, by simulation 
studies by using formula (12.4p . 

2. Form the sets of k-shinglings of the two studied sets. 

3. Pick at random ni and n 2 k-shinglings for the two sets to study. 

4. Apply the RUM algorithm. 

5. Compare the finding similarity with the results of the point ( 1 ). 

6 . Conclude on a significant similarity if the reached similarity, is widely supe¬ 
rior tothe stochastic similarity determined in (1). Otherwise the similarity 
is not accepted. 

7. Apply the RUM algorithm a number of times before doing definitive con¬ 
clusion. 

6.3.4. Conclusion. In this paper we described the main methods of determination 
of the similarity. We empirically estimated the incompressible stochastic similar¬ 
ity between two texts. We proposed a modification of the RU algorithm, named 
RUM, and we applied on subsamples of the studied texts. The combination of the 
Glivenko-Cantelli theorem and an empirical study of the RUM algorithm, leads to 
the conclusion that the approximated similarity that is given by this procedure, 
is a good estimation of the true similarity. Since this approximated similarity is 
computed in seconds, the method showed remarkable performance. Hence it is rec¬ 
ommended for the study of similarity for very large data sets. 

We applied our methods to the four Gospels. The obtained results concern the 
study of Gospels themselves as well as the evaluation of different methods of com¬ 
putation of the similarity. In conclusion, the Gospels have indices of similarity at 
least 50%. 

In a coming paper we will concentrate on the theoretical foundations of the RUM 
algorithm in the setting of Probability theory and Statistics. 
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